Goto

Collaborating Authors

 global minima exist and sgd


Bad Global Minima Exist and SGD Can Reach Them

Neural Information Processing Systems

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize \emph{explicitly}, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels. In contrast, we find that in the presence of explicit regularization, pretraining with random labels has no detrimental effect on SGD. We believe that our results give evidence that explicit regularization plays a far more important role in the success of overparameterized neural networks than what has been understood until now. Specifically, in suppressing complicated models that got lucky with the training data, regularization not only makes simple models that fit the data well the global optima, but it also clears the way to make them discoverable by local methods, such as SGD.


Review for NeurIPS paper: Bad Global Minima Exist and SGD Can Reach Them

Neural Information Processing Systems

Weaknesses: - The paper claims to have shown for the first time that models that perfectly fit the training set can have different degrees of generalization depending on the initialization, ie. This has been previously shown also using a similar technique. See for example "Theoretical issues in deep networks" by Poggio et al. (in PNAS), which shows (among other things) that depending on the standard deviation of the distribution to initialize the weights the network converges to global minima with different test accuracy (see Fig.2). Also, "Classical Generalization Bounds Are Surprisingly Tight For Deep Networks" by Liao et al. (CBMM Memo) introduces the training "Random initialization Training with random labels Training with true labels" and even more: they show that depending on the amount of images with randomized labels the test accuracy after training with the true labels varies accordingly (see Section 2). Fig.2 and 3) and tables are hard to quickly extract conclusions (Table 1).


Bad Global Minima Exist and SGD Can Reach Them

Neural Information Processing Systems

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize \emph{explicitly}, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels. In contrast, we find that in the presence of explicit regularization, pretraining with random labels has no detrimental effect on SGD.